{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 用户数据处理\n",
    "（只取训练集和测试集中出现的用户ID）\n",
    "\n",
    "数据来源于Kaggle竞赛：Event Recommendation Engine Challenge，根据\n",
    "events they’ve responded to in the past\n",
    "user demographic information\n",
    "what events they’ve seen and clicked on in our app\n",
    "用户对某个活动是否感兴趣\n",
    "\n",
    "竞赛官网：\n",
    "https://www.kaggle.com/c/event-recommendation-engine-challenge/data\n",
    "\n",
    "用户描述信息在users.csv文件：共7维特征\n",
    "user_id\n",
    "locale：地区，语言\n",
    "birthyear：出身年\n",
    "gender：性别\n",
    "joinedAt：用户加入APP的时间，ISO-8601 UTC time\n",
    "location：地点\n",
    "timezone：时区"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 导入工具包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-01-20T08:02:58.122225Z",
     "start_time": "2018-01-20T08:02:56.723721Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import numpy as np\n",
    "import scipy.sparse as ss\n",
    "import scipy.io as sio\n",
    "\n",
    "#保存数据\n",
    "import cPickle\n",
    "\n",
    "#event的特征需要编码\n",
    "from utils import FeatureEng\n",
    "from sklearn.preprocessing import normalize\n",
    "#相似度/距离\n",
    "import scipy.spatial.distance as ssd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "总的用户数目超过训练集和测试集中的用户，\n",
    "为节省处理时间和内存，先去处理train和test，得到竞赛需要用到的事件和用户\n",
    "然后对在训练集和测试集中出现过的事件和用户建立新的ID索引\n",
    "先运行user_event.ipynb,\n",
    "得到事件列表文件：PE_userIndex.pkl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 读取之前算好的测试集和训练集中出现过的用户"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-01-20T08:02:58.145732Z",
     "start_time": "2018-01-20T08:02:58.125796Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of users in train & test :3391\n"
     ]
    }
   ],
   "source": [
    "#读取训练集和测试集中出现过的用户列表\n",
    "userIndex = cPickle.load(open(\"PE_userIndex.pkl\", 'rb'))\n",
    "n_users = len(userIndex)\n",
    "\n",
    "print(\"number of users in train & test :%d\" % n_users)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# 处理users.csv --> 特征编码、用户之间的相似度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-01-20T08:02:58.257287Z",
     "start_time": "2018-01-20T08:02:58.149192Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>locale</th>\n",
       "      <th>birthyear</th>\n",
       "      <th>gender</th>\n",
       "      <th>joinedAt</th>\n",
       "      <th>location</th>\n",
       "      <th>timezone</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3197468391</td>\n",
       "      <td>id_ID</td>\n",
       "      <td>1993</td>\n",
       "      <td>male</td>\n",
       "      <td>2012-10-02T06:40:55.524Z</td>\n",
       "      <td>Medan  Indonesia</td>\n",
       "      <td>480.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3537982273</td>\n",
       "      <td>id_ID</td>\n",
       "      <td>1992</td>\n",
       "      <td>male</td>\n",
       "      <td>2012-09-29T18:03:12.111Z</td>\n",
       "      <td>Medan  Indonesia</td>\n",
       "      <td>420.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>823183725</td>\n",
       "      <td>en_US</td>\n",
       "      <td>1975</td>\n",
       "      <td>male</td>\n",
       "      <td>2012-10-06T03:14:07.149Z</td>\n",
       "      <td>Stratford  Ontario</td>\n",
       "      <td>-240.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1872223848</td>\n",
       "      <td>en_US</td>\n",
       "      <td>1991</td>\n",
       "      <td>female</td>\n",
       "      <td>2012-11-04T08:59:43.783Z</td>\n",
       "      <td>Tehran  Iran</td>\n",
       "      <td>210.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3429017717</td>\n",
       "      <td>id_ID</td>\n",
       "      <td>1995</td>\n",
       "      <td>female</td>\n",
       "      <td>2012-09-10T16:06:53.132Z</td>\n",
       "      <td>NaN</td>\n",
       "      <td>420.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id locale birthyear  gender                  joinedAt  \\\n",
       "0  3197468391  id_ID      1993    male  2012-10-02T06:40:55.524Z   \n",
       "1  3537982273  id_ID      1992    male  2012-09-29T18:03:12.111Z   \n",
       "2   823183725  en_US      1975    male  2012-10-06T03:14:07.149Z   \n",
       "3  1872223848  en_US      1991  female  2012-11-04T08:59:43.783Z   \n",
       "4  3429017717  id_ID      1995  female  2012-09-10T16:06:53.132Z   \n",
       "\n",
       "             location  timezone  \n",
       "0    Medan  Indonesia     480.0  \n",
       "1    Medan  Indonesia     420.0  \n",
       "2  Stratford  Ontario    -240.0  \n",
       "3        Tehran  Iran     210.0  \n",
       "4                 NaN     420.0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#读取数据\n",
    "users = pd.read_csv(\"users.csv\")\n",
    "users.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-01-20T08:02:58.285702Z",
     "start_time": "2018-01-20T08:02:58.261413Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 38209 entries, 0 to 38208\n",
      "Data columns (total 7 columns):\n",
      "user_id      38209 non-null int64\n",
      "locale       38209 non-null object\n",
      "birthyear    38209 non-null object\n",
      "gender       38100 non-null object\n",
      "joinedAt     38152 non-null object\n",
      "location     32745 non-null object\n",
      "timezone     37773 non-null float64\n",
      "dtypes: float64(1), int64(1), object(5)\n",
      "memory usage: 2.0+ MB\n"
     ]
    }
   ],
   "source": [
    "users.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-01-20T08:03:51.795065Z",
     "start_time": "2018-01-20T08:02:58.291393Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "FE = FeatureEng()\n",
    "\n",
    "#locale\tbirthyear\tgender\tjoinedAt\tlocation\ttimezone\n",
    "#去掉user_id列\n",
    "n_cols = users.shape[1] - 1\n",
    "cols = ['LocaleId', 'BirthYearInt', 'GenderId', 'JoinedYearMonth', 'CountryId', 'TimezoneInt']\n",
    "\n",
    "#users编码后的特征\n",
    "#userMatrix = np.zeros((n_users, n_cols), dtype=np.int)\n",
    "userMatrix = ss.dok_matrix((n_users, n_cols))\n",
    "print(userMatrix)\n",
    "\n",
    "for u in range(users.shape[0]): \n",
    "    userId = str(users.loc[u,'user_id'])\n",
    "    \n",
    "    if userIndex.has_key(userId):  #在训练集或测试集中出现\n",
    "        i = userIndex[userId]\n",
    "    \n",
    "        userMatrix[i, 0] = FE.getLocaleId(users.loc[u,'locale'])\n",
    "        userMatrix[i, 1] = FE.getBirthYearInt(users.loc[u,'birthyear'])\n",
    "        userMatrix[i, 2] = FE.getGenderId(users.loc[u,'gender'])\n",
    "        userMatrix[i, 3] = FE.getJoinedYearMonth(users.loc[u,'joinedAt'])\n",
    "        \n",
    "        #由于地点的写法不规范，该编码似乎不起作用（所有样本的特征都被编码成0了）\n",
    "        userMatrix[i, 4] = FE.getCountryId(users.loc[u,'location'])\n",
    "        \n",
    "        userMatrix[i, 5] = FE.getTimezoneInt(users.loc[u,'timezone'])\n",
    "\n",
    "# 归一化用户矩阵\n",
    "userMatrix = normalize(userMatrix, norm=\"l2\", axis=0, copy=False)\n",
    "sio.mmwrite(\"US_userMatrix\", userMatrix)\n",
    "\n",
    "\n",
    "# 计算用户相似度矩阵，之后用户推荐系统\n",
    "userSimMatrix = ss.dok_matrix((n_users, n_users))\n",
    "\n",
    "#读取在测试集和训练集中出现的用户对\n",
    "uniqueUserPairs = cPickle.load(open(\"FE_uniqueUserPairs.pkl\", 'rb'))\n",
    "\n",
    "#对角线元素\n",
    "for i in range(0, n_users):\n",
    "    userSimMatrix[i, i] = 1.0\n",
    "    \n",
    "#对称\n",
    "for u1, u2 in uniqueUserPairs:\n",
    "    #i = userIndex[u1]\n",
    "    #j = userIndex[u2]\n",
    "    i = u1\n",
    "    j = u2\n",
    "    if not userSimMatrix.has_key((i, j)):\n",
    "        #Person相关系数做为相似度度量\n",
    "        #特征：国家（locale、location）、年龄、性别、时区、地点\n",
    "        #usim = ssd.correlation(userMatrix[i,:],\n",
    "            #userMatrix[j,:])\n",
    "    \n",
    "        usim = ssd.correlation(userMatrix.getrow(i).todense(),\n",
    "          userMatrix.getrow(j).todense())\n",
    "        userSimMatrix[i, j] = usim\n",
    "        userSimMatrix[j, i] = usim\n",
    "    \n",
    "sio.mmwrite(\"US_userSimMatrix\", userSimMatrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
